url = "http://www.guancha.cn/" thepage=readLines(url, encoding = "UTF-8") # important to set utf-8 str(thepage) head(thepage) thepage[6] We can get the current locale in R using Sys.getlocale(). > Sys.getlocale() If I want to change current locale to Chinese, how should I proceed? Well, we can change the locale in Windows Control Panel -> Region and Language -> Format choose the format you want, e.g. “Chinese (Simplified, PRC)”, and apply change. Now, restart R and check the locale information using command Sys.getlocale(). We can see the default locale for R is “Chinese (Simplified)_People’s Republic of China.936”, and the code page ‘936’ is the character encoding “GB18030”. > Sys.getlocale() [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936" What if we want to change the locale temporary in R? For example, we need to set the encoding for traditional Chinese instead of Simplified Chinese, Can we use the command: Sys.setlocale(category = "LC_ALL", locale = "Chinese (Traditional)_People's Republic of China.936")? Yes, you can. But it doesn’t work! Why? That’s because ‘936’ is the encoding for simplified chinese characters not for traditional chinese characters. Then you want a list of available locales to determine the legal locale options. Unlike Linux, we can’t (or have no easy method to) find such list. But you can find the list by searching Windows Language strings on internet. zh-cn Chinese (China) The currently language strings list can be found here. The “Language string” column contains the legal input for setting locale in R. For example, if I want to change current locale to Traditional Chinese, I use command: Sys.setlocale(category = "LC_ALL", locale = "cht"). Sys.setlocale(category = "LC_ALL", locale = "zh-cn") > Sys.setlocale(category = "LC_ALL", locale = "cht") [1] "LC_COLLATE=Chinese (Traditional)_Taiwan.950;LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950;LC_NUMERIC=C;LC_TIME=Chinese (Traditional)_Taiwan.950" Check the current locale setting, we can see it changed to “LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950”, where codepage ‘950’ is encoding Big5. Windows uses only one character encoding for each language, which make the recognization of encoding type easy. If I received a script contains Simplified Chinese characters and generated under Windows OS, the encoding must be GB18030. To show it correctly, I can set or change the locale of R to “chinese”, and import the source code using readLines, or there is a even easy way: using Office Word! If using RStudio as the editor for R, we just need to change the default encoding to “GB18030” in “Tools -> Options -> General -> Default text encoding”. Apply the setting, then open the script, it should be displayed correctly. Hello, nice to see you again! [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936" [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936" Error in tools::startDynamicHelp() : unable to create socket > Sys.getlocale() [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"